Python%20Project%20Alix%20Petitcol%20-%20Marine%20Sublet%20Core%20DIA7%20-%20A4%20-%20ESILV%20-%202022.png

Facebook Python Project

Welcome on this notebook, dedicated to our python project.

Reminder of the project : From a set of data, carry out a complete study with visualization and machine learning algorithms in order to explain the different links existing between the variables of the dataset.

Perform your visualization study on a Jupyter notebook and offer a Flask API to visualize and create one of the best prediction models you will find, where a user can choose the parameters suitable for the model.

Note I : The main purpose here is to estimate the number of comments that a Facebook message should receive in the hours following its post. The number of comment is modelized by the column "Target Variable" in the datasets. The analysis will be done first by studying through different graphs user behavior and the trends that stand out the most. Secondly, you will find the Machine Learning part where we perform an ACP and then we implement some prediction algorithms using regression techniques.

Note II : [For the Prediction part] The Flask API is enable on the same directory and can be run when using the python command : python MyFlaskApp.py in your terminal.

Table of Contents

Libraries Imports and Installation

Import Libraries

Import the datasets

Prepare the datasets

1) Fast inspection of the dataset :

2) Rename the columns of all the dataset as explained in the given document

3) Verification of the datasets structures

4) Verification of the values' datasets composition

There is no null value in the three datasets, it is a good new because all rows are exploitables.

5) Resume the published day of comment in a single column [For the visualization part only]

6) Create new datasets with only understandable columns [For the visualization part only]

7) Change the Pages Categorie's numbers to Page Categories labels [For the visualization part only]

Visualization Study

1. General Study of the dataset composition

1) Observe the correlation between the different elements in the first dataset

Interpretation :

We can observe that, for instance, the target variable is highly correlated with CC1 (The total number of comments before selected base date/time), CC2 (The number of comments in last 48 to last 24 hours relative to base date/time), CC4 (The number of comments in the first 24 hours after the publication of post but before base date/time) and CC5 (The difference between CC2 and CC3) Which is purely logical.

Then the main features, positivly correlated with target variable are the post share count and the Page talking about, which defines the daily interest of individuals towards source of the document/ Post. The people who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares, etc by visitors to the page.

2) Observe the number of comment depending of the hour number after the publication annd the category

You need to download the notebook and execute the cells to obtain this graph :

newplot%20%287%29.png

3) Page Popularity according to the weekday

Interpretation :

There is no significal difference between the day of the week and the page popularity

2. Study of several feature's impact on target variable

2.1 : Focus Study on Page Category

Interpretation :

This is the word cloud of the most represented categories. sports, teams, and professionals are the words that appear most frequently

1) Focus on Page Category with the higher rate of comments [in terms of sum of target variable]

You need to download the notebook and execute the cells to obtain this graph :

newplot.png

2) Focus on Page Category with the higher rate of comments in general [in terms of mean of target variable]

You need to download the notebook and execute the cells to obtain this graph : newplot%20%281%29.png

Interpretation :

In terms of the sum of comments, the page category that reaches the first rank is the professional sports team with a total number of comments equal to 94,339. However, if we look in general and in terms of average on the data set, the ranking is different.

4) Radar of the top 3 Categories

You need to download the notebook and execute the cells to obtain this graph :

radar1.png

You need to download the notebook and execute the cells to obtain this graph :

radar2.png

Interpretation :

Here we can observe the combination and distribution of the different components that make up the pages of the top 3 page categories (in terms of sum or average) as studied above.

2.1 : Weekday impact

1) Visualize the repartition of page checkings according to the weekday

Interpretation :

Saturday seems to be the day when most of the consumers use Facebook and visited pages. The average number of pages checkings is around 16%, i.e. 1 to 3% higher than for the other daysThe average of page chechings is around 16%.

2) Visualization of the average number of comments according to weekday

Interpretation :

This graph shows that, on average, the best day to post should be Wednesday, as it appears that the customer activity of posting reviews is higher than on other days. The minimum number of comments posted concerns Saturday with an average of 6.23 comments per post.

3) Visualization of the top "page talking about" depending on week day and category

You need to download the notebook and execute the cells to obtain this graph :

image-2.png

4) Impact of the Weekday on the number of share post

Interpretation :

On the first graph we understand that in average, the Professional sport team Page category have higher number of share count. This could explain why it is the first category page on ranking of target variable.

Regarding the second graph, we also see that the statistics of "Page talking about " for the Professional sports team page category (i.e. the 1st quartile, median, 3rd quartile) are really higher than the two other categories pages described on the graph. This also enforces the hypothesis made with the first graph.

5) Influence of the weekday on the number of post sharing and the number of post's comments According to the category page

You need to download the notebook and execute the cells to obtain this graph : newplot%20%282%29.png

2.3 : Influence of Length

1) Correlation between the length of a post and its number of comment

You need to download the notebook and execute the cells to obtain this graph : newplot%20%283%29.png

Interpretation :

We observe that the post's length that generates the higher number of comments resids in a plage between 1000 and 4800 characters. This may indicate that the shorter the post is and the higher the number of lectors would react to it.

A long post is less susceptible to being "popular" in terms of the number of comments. this is certainly due to the fact that lectors do not have time to read and react to these types of posts.

2.1) Impact of the length and the share count on target variable

You need to download the notebook and execute the cells to obtain this graph : newplot%20%284%29.png

Interpretation :

Regarding the graph above, we observe the impact of the length of a post and its distribution on Facebook (its number of shares) on the number of resulting comments. As surprising as it may seem, we observe a large number of posts for which the number of shares is important but which does not present a significant number of comments in comparison. One way to explain this could be the presence of posts for which comments could be disabled.

2.2) Impact of the length and the share count on target variable when removing row without target variable

You need to download the notebook and execute the cells to obtain this graph : newplot%20%285%29.png

2.4 : Other features' influences

2) Correlation between the page popularity and its number of comment

Interpretation :

Surprisingly, The Page popularity of a page seems not to influence the number of comments under a post.

3) Correlation between post number of share and target variable

You need to download the notebook and execute the cells to obtain the graph :

You need to download the notebook and execute the cells to obtain this graph : newplot%20%286%29.png

Interpretation :

Weirdly, many page categories have a high number of posts with zero target variables

Machine Learning

1. Principal Component Analysis

Import Libraries
Data Preparation
PCA

1) Application to the data

2) Scree plot

Interpretation :

We can see that around the 23-26 first components, 95% of the variance is explained. We won't need the components after the 26th one but let's keep digging in order to know exactly when we reach 95%.

3) Cumsum of explained variance

Interpretation :

The PCA explains 95% of the variance with the 25 first components. That means that we can use the 25 first components without loosing a lot of information. It would also make the predictions faster because we use less data.

2. Prediction

Data Preparation
SVM

image.png

Lasso
Ridge
ElasticNetCV
RandomForestRegressor
GradientBoostingRegressor
VotingClassifier
Graphics

You need to download the notebook and execute the cells to obtain this graph :

image.png

You need to download the notebook and execute the cells to obtain this graph :

image.png

You need to download the notebook and execute the cells to obtain this graph :

image.png